5 Probability
The usual touchstone of whether what someone asserts is mere persuasion or at least a subjective conviction, i.e., firm belief, is betting. Often someone pronounces his propositions with such confident and inflexible defiance that he seems to have entirely laid aside all concern for error. A bet disconcerts him. Sometimes he reveals that he is persuaded enough for one ducat but not for ten. For he would happily bet one, but at ten he suddenly becomes aware of what he had not previously noticed, namely that it is quite possible that he has erred. -— Immanuel Kant, Critique of Pure Reason
The central tension, and opportunity, in data science is the interplay between the data and the science, between our empirical observations and the models which we use to understand them. Probability is the language we use to explore that interplay; it connects models to data, and data to models.
The only package we need in this chapter is tidyverse.
5.1 List-columns and map functions
Before learning about probability, we need to expand our collection of R tricks by understanding list-columns and map_* functions. Recall that a list is different from an atomic vector. In atomic vectors, each element of the vector has one value. Lists, however, can contain vectors, and even more complex objects, as elements.
## [[1]]
## [1] 4 16 9
##
## [[2]]
## [1] "A" "Z"
x is a list with two elements. The first element is a numeric vector of length 3. The second element is a character vector of length 2. We use [[]] to extract specific elements.
x[[1]]## [1] 4 16 9
There are a number of built-in R functions that output lists. For example, the ggplot objects you have been making store all of the plot information in lists. Any function that returns multiple values can be used to create a list output by wrapping that returned object with list().
## # A tibble: 1 x 1
## col_1
## <list>
## 1 <dbl [2]>
Notice this is a 1-by-1 tibble with one observation, which is a list of one element. Voila! You have just created a list-column.
If a function returns multiple values as a vector, like range() does, you must use list() as a wrapper if you want to create a list-column.
A list column is a column of your data which is a list rather than an atomic vector. As with stand-alone list objects, you can pipe to str() to examine the column.
## tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
## $ col_1:List of 1
## ..$ : num [1:2] -1.737 0.834
We can use map_* functions to both create a list-column and then, much more importantly, work with that list-column afterwards.
## tibble [1 × 2] (S3: tbl_df/tbl/data.frame)
## $ col_1:List of 1
## ..$ : num [1:2] -1.737 0.834
## $ col_2: num -0.902
map_* functions, like map_dbl() in this example, take two key arguments, .x (the data which will be acted on) and .f (the function which will act on this data). Here, .x is the data in col_1, which is a list-column. .f is the function sum(). However, we can not simply write map_dbl(col_1, sum). Instead, each use of map_* functions requires the use of a tilde — a ~ — to indicate the start of the function and the use of a dot — a . — to specify where the data goes in the function.
map_* functions are a family of functions, with the suffix specifying the type of the object to be returned. map() itself returns a list. map_dbl() returns a double. map_int() returns an integer. map_chr() returns a character, and so on.
tibble(ID = 1) %>%
mutate(col_1 = map(ID, ~range(rnorm(10)))) %>%
mutate(col_2 = map_dbl(col_1, ~ sum(.))) %>%
mutate(col_3 = map_int(col_1, ~ length(.))) %>%
mutate(col_4 = map_chr(col_1, ~ sum(.))) %>%
str()## tibble [1 × 5] (S3: tbl_df/tbl/data.frame)
## $ ID : num 1
## $ col_1:List of 1
## ..$ : num [1:2] -1.28 1.07
## $ col_2: num -0.216
## $ col_3: int 2
## $ col_4: chr "-0.216124"
Consider a more detailed example:
# This simple example demonstrates the workflow which we will often follow.
# Start by creating a tibble which will be used to store the results. (Or start
# with a tibble which already exists and to which you will be adding more
# columns.) It is often convenient to get all the code working with just a few
# rows. Once it is working, we increase the number of rows to a thousand or
# million or whatever we need.
tibble(ID = 1:3) %>%
# The big convenience is being able to store a list in each row of the tibble.
# Note that we are not using the value of ID in the call to rnorm(). (That is
# why we don't have a "." anywhere.) But we are still using ID as a way of
# iterating through each row; ID is keeping count for us, in a sense.
mutate(draws = map(ID, ~ rnorm(10))) %>%
# Each succeeding step of the pipe works with columns already in the tibble
# while, in general, adding more columns. The next step calculates the max
# value in each of the draw vectors. We use map_dbl() because we know that
# max() will returns a single number.
mutate(max = map_dbl(draws, ~ max(.))) %>%
# We will often need to calculate more than one item from a given column like
# draws. For example, in addition to knowing the max value, we would like to
# know the range. Because the range is a vector, we need to store the result
# in a list column. map() does that for us automatically.
mutate(min_max = map(draws, ~ range(.)))## # A tibble: 3 x 4
## ID draws max min_max
## <int> <list> <dbl> <list>
## 1 1 <dbl [10]> 1.01 <dbl [2]>
## 2 2 <dbl [10]> 1.04 <dbl [2]>
## 3 3 <dbl [10]> 1.70 <dbl [2]>
This flexibility is only possible via the use of list-columns and map_* functions. This workflow is extremely common. We start with an empty tibble, using ID to specify the number of rows. With that skeleton, each step of the pipe adds a new column, working off a column which already exists.
5.2 Probability distributions

FIGURE 5.1: Dice and Probability.
What does it mean that Trump had a 30% chance of winning re-election in the fall of 2020? That there is a 90% probability of rain today? That the dice at the casino are unfair?
Probability is about quantifying uncertainty. We can think of probability as a proportion. The probability of an event occurring is a number from 0 to 1, where 0 means that the event is impossible and 1 means that the event is 100% certain.
Let’s begin with the simplest events: coin flips and dice rolls. If the dice and the coins are fair, we can operate under the assumption that all outcomes are equally likely.
This allows us to make the following statements:
- The probability of rolling a 1 or a 2 is 2/6, or 1/3.
- The probability of rolling a 1, 2, 3, 4, 5, or 6 is 1.
- The probability of flipping a coin and getting tails is 1/2.
For the purposes of this Primer, a probability distribution is a mathematical object that covers a set of outcomes, where each distinct outcome has a chance of occurring between 0 and 1 inclusive. The chances must sum to 1. The set of possible outcomes — heads or tails for the coin, 1 through 6 for a single die, 2 through 12 for a pair of dice — can be either discrete or continuous. This set of outcomes is the domain of the probability distribution. There are three types of probability distributions: mathematical, empirical, and posterior.
The key difference between a distribution, as we have explored them Section 2.8, and a probability distribution is the requirement that the sum of the probabilities of the individual outcomes must be exactly 1. There is no such requirement in a distribution. But any distribution can be turned into a probability distribution by “normalizing” it, as we will explore. In this context, we will often refer to a distribution which is not (yet) a probability distribution as an “unnormalized” distribution.
Pay attention to notation. Whenever we are talking about a specific probability (represented by a single value), we will use \(\rho\) (the Greek letter “rho” but spoken aloud as “p” by us) with a subscript which specifies the exact outcome of which it is the probability. For instance, \(\rho_h = 0.5\) denotes the probability of getting heads on a coin toss when the coin is fair. \(\rho_t\) — spoken as “PT” or “P sub T” or “P tails” — denotes the probability of getting tails on a coin toss. However, when we are referring to the entire probability distribution over a set of outcomes, we will use \(\text{Prob}()\). For example, the probability distribution of a coin toss is \(\text{Prob}(\text{coin})\). That is, \(\text{Prob}(\text{coin})\) is composed of the two specific probabilities (50% and 50%) mapped from the two values in the domain (Heads and Tails). Similarly, \(\text{Prob}(\text{sum of two dice})\) is the probability distribution over the set of 11 outcomes (2 through 12) which are possible when you take the sum of two dice. \(\text{Prob}(\text{sum of two dice})\) is made up of 11 numbers — \(\rho_2\), \(\rho_3\), …, \(\rho_{12}\) — each representing the unknown probability that the sum will equal their value. That is, \(\rho_2\) is the probability of rolling a 2.
5.2.1 Flipping a coin
A mathematical distribution is based on a mathematical formula. Assuming that the coin is perfectly fair, we should, on average, get heads as often as we get tails.

An empirical distribution is based on data. You can think of this as the probability distribution created by running a simulation. In theory, if we increase the number of coins we flip in our simulation, the empirical distribution will look more and more similar to the mathematical distribution. The probability distribution is the Platonic form. The empirical distribution will often look like the mathematical probability distribution, but it will rarely be exactly the same.
In this simulation, there are 56 heads and 44 tails. The outcome will vary every time we run the simulation, but the proportion of heads to tails should not be too different if this coin is fair.
# We are flipping one fair coin a hundreds times. We need to get the same result
# each time we create this graphic because we want the results to match the
# description in the text. Using set.seed() guarantees that the random results
# are the same each time. We define 0 as tails and 1 as heads.
set.seed(3)
tibble(results = sample(c(0, 1), 100, replace = TRUE)) %>%
ggplot(aes(x = results)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
binwidth = 0.5,
color = "white") +
labs(title = "Empirical Probability Distribution",
subtitle = "Flipping one coin a hundred times",
x = "Outcome\nResult of Coin Flip",
y = "Probability") +
scale_x_continuous(breaks = c(0, 1),
labels = c("Heads", "Tails")) +
scale_y_continuous(labels =
scales::percent_format(accuracy = 1)) +
theme_classic()
A posterior distribution is based on beliefs and expectations. It displays your belief about things you can’t see right now. You may have posterior distributions for outcomes in the past, present, or future.
In the case of the coin toss, the posterior distribution changes depending on your beliefs. For instance, let’s say your friend brought a coin to school and asked to bet you. If the result is heads, you have to pay them $5.
This makes you suspicious. Your posterior distribution would reflect this. You might believe that \(\rho_h\) is 0.95 and \(\rho_t\) is 0.05.

The full terminology is mathematical (or empirical or posterior) probability distribution. But we will often shorten this to just mathematical (or empirical or posterior) distribution. The word “probability” is understood, even if it is not present.
5.2.2 Rolling two dice
Our mathematical distribution tells us that, with a fair dice, the probability of getting 1, 2, 3, 4, 5, and 6 are equal: there is a 1/6 chance of each. When we roll two dice at the same time and sum the numbers, the values closest to the middle are more common than values at the edge because there are more combinations of numbers that add up to the middle values.

We get an empirical distribution by rolling two dice a hundred times, either by hand or with a computer simulation. The result is not identical to the mathematical distribution because of the inherent randomness of the real world and/or of simulation.
# In the coin example, we create the vector ahead of time, and then assigned
# that vector to a tibble. There was nothing wrong with that approach. And we
# could do the same thing here. But the use of map_* functions is more powerful.
set.seed(1)
emp_dist_dice <- tibble(ID = 1:100) %>%
mutate(die_1 = map_dbl(ID, ~ sample(c(1:6), size = 1))) %>%
mutate(die_2 = map_dbl(ID, ~ sample(c(1:6), size = 1))) %>%
mutate(sum = die_1 + die_2) %>%
ggplot(aes(x = sum)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
binwidth = 1,
color = "white") +
labs(title = "Empirical Probability Distribution",
subtitle = "Sum from rolling two dice, replicated one hundred times",
x = "Outcome\nSum of Two Die",
y = "Probability") +
scale_x_continuous(breaks = seq(2, 12, 1), labels = 2:12) +
scale_y_continuous(labels =
scales::percent_format(accuracy = 1)) +
theme_classic()
emp_dist_dice
We might consider labeling the y-axis in plots of empirical distributions as “Proportion” rather than “Probability” since it is an actual proportion, calculated from real (or simulated) data. We will keep it as “Probability” since we want to emphasize the parallels between mathematical, empirical and posterior probability distributions.
The posterior distribution for rolling two dice a hundred times depends on your beliefs. If you take the dice from your Monopoly set, you have reason to believe that the assumptions underlying the mathematical distribution are true. However, if you walk into a crooked casino and a host asks you to play craps, you might be suspicious. In craps, a come-out roll of 7 and 11 is a “natural,” resulting in a win for the “shooter” and a loss for the casino. You might expect those numbers to occur less often than they would with fair dice. Meanwhile, a come-out roll of 2, 3 or 12 is a loss for the shooter. You might also expect values like 2, 3 and 12 to occur more frequently. Your posterior distribution might look like this:

Someone less suspicious of the casino would have a posterior distribution which looks more like the mathematical distribution.
5.2.3 Presidential elections
Now let’s say we are building probability distributions for political events, like a presidential election. We want to know the probability that Democratic candidate wins X electoral votes, where X comes from the range of possible outcomes: 0 to 538. (The total number of electoral votes in US elections since 1964 is 538.)
We can start with a mathematical distribution for X which assumes that the chances of the Democratic candidate winning any given state’s electoral votes is 0.5 and the results from each state are independent.

We know that campaign platforms, donations, charisma, and many other factors will contribute to a candidate’s success. Elections are more complicated than coin tosses. We also know that many presidential elections in history have resulted in much bigger victories or defeats than this distribution seems to allow for.
The empirical distribution in this case could involve looking into past elections in the United States and counting the number of electoral votes that the Democrats won in each. For the empirical distribution, we create a tibble with electoral vote results from past elections. Looking at elections since 1964, we can observe that the number of electoral votes that the Democrats received in each one is different. Given that we only have 15 entries, it is difficult to draw conclusions or make predictions based off of this empirical distribution.
However, this model is enough to suggest that the assumptions of the mathematical probability distribution above do not work for electoral votes. The model assumes that the Democrats have a 50% chance of receiving each of the 538 votes. Just looking at the mathematical probability distribution, we can observe that receiving 13 or 17 or 486 votes out of 538 would be extreme and almost impossible under this mathematical model. However, our empirical distribution tells us that those were real election results.

The posterior distribution of electoral votes is a popular topic, and an area of strong disagreement, among data scientists. Consider this posterior from FiveThirtyEight.

Here is a posterior from the FiveThirtyEight website from August 13, 2020. This was created using the same data as the above distribution, but simply displayed differently. For each electoral result, the height of the bar represents the probability that a given event will occur. However, there are no lablels y-axis telling us what the specific probability of each outcome is. And that is OK! The specific values are not that useful. If we removed the labels on our y-axes, would it matter?

Here is the posterior from The Economist, also from August 13, 2020. This looks confusing at first because they chose to merge the axes for Republican and Democratic electoral votes. We can tell that The Economist was less optimistic, relative to FiveThirtyEight, about Trump’s chances in the election.

These two models, built by smart people using similar data sources, have reached fairly different conclusions. Data science is difficult! There is not one “right” answer. Real life is not a problem set.
 and [here](https://statmodeling.stat.columbia.edu/2020/08/31/problem-of-the-between-state-correlations-in-the-fivethirtyeight-election-forecast/).](05-probability/images/538_versus_Economist.png)
FIGURE 5.2: Watch the makers of these two models throw shade at each other on Twitter! Eliot Morris is one of the primary authors of the Economist model. Nate Silver is in charge of 538. They don’t seem to be too impressed with each other’s work! More smack talk here and here.
There are many political science questions you could explore with posterior distributions. They can relate to the past, present, or future.
- Past: How many electoral votes would Hilary Clinton have won if she had picked a different VP?
- Present: What are the total campaign donations from Harvard faculty?
- Future: How many electoral votes will the Democratic candidate for president win in 2024?
5.2.4 Continuous distributions
The three examples above are all discrete probability distributions, meaning that the outcome variable can only take on a limited set of values. A coin flip has two outcomes. The sum of a pair of dice has 11 outcomes. The total electoral votes for the Democratic candidate has 539 possible outcomes. In the limit, we can also create continuous probability distributions which have an infinite number of possible outcomes. For example, the average height for an American male could be any real number between 0 inches and 100 inches. (Of course, an average value anywhere near 0 or 100 is absurd. The point is that the average could be 68.564, 68.5643, 68.56432 68.564327, or any real number.)
All the characteristics for discrete probability distributions which we reviewed above apply just as much to continuous probability distributions. For example, we can create mathematical, empirical and posterior probability distributions for continuous outcomes just as we did for discrete outcomes. For example, a posterior probability distribution for \(p\), the percentage of US citizens who approve of Biden, might look like:

Comments:
The truth is out there. If we asked all 300+ million Americans whether or not they approve of President Biden, we could know \(p\) exactly. Alas, we can’t do that. We use a posterior probability distribution to summarize are beliefs about the true value of \(p\), a truth we can never confirm.
Continuous variables are a myth. Nothing that can be represented on a computer is truly continuous. Even something which appears continuous, like \(p\), actually can only take on a (very large) set of discrete variables. In this case, there are approximately 300 million possible true values of \(p\), one for each total number of people who approve of President Biden.
The math of continuous probability distributions can be tricky. Read a book on mathematical probability for all the messy details. Little of that matters in applied work.
The most important difference is that, with discrete distributions, it makes sense to estimate the probability of a specific outcome. What is the probability of rolling a 9? With continuous distributions, this makes no sense because there are an infinite number of possible outcomes. With continuous variables, we only estimate intervals. What is the probability that Biden’s true approval percentage is between 50% and 55%?
Don’t worry about the distinctions between discrete and continuous outcomes, or between the discrete and continuous probability distributions which we will use to summarize our beliefs about those outcomes. The basic intuition is the same in both cases.
5.2.5 Working with probability distributions

FIGURE 5.3: Bruno de Finetti, an Italian statistician who wrote a famous treatise on the theory of probability that began with the statement “PROBABILITY DOES NOT EXIST.” This is because probability only exists subjectively in our minds.
A probability distribution is not always easy to work with. It is a complex object. And, in many contexts, we don’t really care about all that complexity. So, instead of providing the full probability distribution, we often just use a summary measure, a number or two or three which captures those aspects of the entire distribution which are relevant to the matter at hand. Let’s explore these issues using the 538 posterior probability distribution, as of August 13, 2020, for the number of electoral votes which will be won by Joe Biden. Here is a tibble with 1,000,000 draws from that distribution:
draws## # A tibble: 1,000,000 x 2
## ID electoral_votes
## <int> <int>
## 1 1 333
## 2 2 352
## 3 3 338
## 4 4 437
## 5 5 346
## 6 6 171
## 7 7 318
## 8 8 210
## 9 9 261
## 10 10 298
## # … with 999,990 more rows
A distribution and a sample of draws from that distribution are different things. But, if you squint, they are sort of the same thing, at least for our purposes. For example, if you want to know the mean of the distribution, then the mean of the draws will be a fairly good estimate, especially if the number of draws is large enough.
Recall from Chapter 2 how we can draw randomly from specified probability distributions:
rnorm(10)## [1] 0.030 0.732 0.091 0.263 -0.100 -1.533 -0.194 0.604 -1.913 0.345
runif(10)## [1] 0.128 0.618 0.975 0.543 0.374 0.920 0.709 0.210 0.048 0.936
The elements of these vectors are all “draws” from the specified probability distributions. In most applied situations, our tools will produce draws rather than summary objects. Fortunately, a vector of draws is very easy to work with. Start with summary statistics:
key_stats <- draws %>%
summarize(mn = mean(electoral_votes),
md = median(electoral_votes),
sd = sd(electoral_votes),
mad = mad(electoral_votes))
key_stats## # A tibble: 1 x 4
## mn md sd mad
## <dbl> <dbl> <dbl> <dbl>
## 1 325. 326 86.9 101.
Calculate a 95% interval directly:
## 2.5% 98%
## 172 483
Approximate the 95% interval in two ways:
c(key_stats$mn - 2 * key_stats$sd,
key_stats$mn + 2 * key_stats$sd)## [1] 152 499
c(key_stats$md - 2 * key_stats$mad,
key_stats$md + 2 * key_stats$mad)## [1] 124 528
In this case, using the mean and standard deviation produces a 95% interval which is closer to the true interval. In other cases, the median and scaled median absolute deviation will do better. Either approximation is generally “good enough” for most work. But, if you need to know the exact 95% interval, you must use quantile().
5.2.6 Unnormalized distributions
Remember that probability distributions are mathematical objects that cover a set of outcomes, where each outcome in the domain is mapped to a probability value between 0 and 1 inclusive and the sum of all mappings is 1. Sometimes, you may see distributions similar to probability distributions, only the y-axis displays raw counts instead of proportions. Unnormalized distributions are not probability distributions, but it is easy to convert between the two. You simply divide all the outcome counts on the y-axis by the sum of all outcome counts to “normalize” the unnormalized distribution. Unnormalized distributions are often an intermediary step; it is sometimes handy to work with counts until the very end.
For instance, we can generate the following unnormalized distribution for the sum of rolling two dice. (This uses the same code as above, but without the normalization step.)

Notice that the shape of the distribution is the same as the empirical probability distribution we generated earlier, except that the y-axis is labeled differently.

The two plots — unnormalized and normalized — have the exact same shape. In many ways, they are the same object. Yet normalization is required if we want to work with a probability distribution.
5.2.7 Joint distributions
Recall that \(\text{Prob}(\text{coin})\) is the probability distribution for the result of a coin toss. It includes two parts, the probability of heads (\(\rho_h\)) and the probability of tails (\(\rho_t\)). This is a univariate distribution because there is only one outcome, which can be heads or tails. If there is more than one outcome, then we have a joint distribution.
Joint distributions are also mathematical objects that cover a set of outcomes, where each distinct outcome has a chance of occurring between 0 and 1 and the sum of all chances must equal 1. The key to a joint distribution is it measures the chance that both events A and B will occur. The notation is \(\text{Prob}(A, B)\).
Let’s say that you are rolling two six-sided dice simultaneously. Die 1 is weighted so that there is a 50% chance of rolling a 6 and a 10% chance of each of the other values. Die 2 is weighted so there is a 50% chance of rolling a 5 and a 10% chance of rolling each of the other values. Let’s roll both dice 1,000 times. In previous examples involving two dice, we cared about the sum of results and not the outcomes of the first versus the second die of each simulation. With a joint distributions, the order matters; so instead of 11 possible outcomes on the x-axis of our distribution plot (ranging from 2 to 12), we have 36. Furthermore, a 2D probability distribution is not sufficient to represent all of the variables involved, so the joint distribution for this example is displayed using a 3D plot.
5.3 Tree diagrams
5.3.1 Independence
So far, you have learned how to explore \(\text{Prob}(A)\), which is the fancy, statistical way of saying the probability distribution for event A. Keep in mind the distinction between an individual outcome from the set \(A\), like the probability of heads (which is \(\rho_h\)) and the entire distribution \(\text{Prob}(\text{coin})\), which includes the probability of each possible outcomes. \(\text{Prob}(A)\) is like \(\text{Prob}(\text{coin})\) or \(\text{Prob}(\text{sum of two dice})\).
What if you flipped two coins in a row, one after the other? You know that the probability of getting heads on the first coin is 1/2. But what are the odds of getting heads two times in a row? Let’s take a look at this tree diagram. We read this diagram from left to right. On the left, the probability of getting heads is 0.5 on the first toss. Then, the tree branches out.
- If we got heads the first time, then we go up the top branch. The probability of getting heads again is 0.5.
- If we got tails the first time, then we go down the bottom branch. The probability of getting heads is 0.5.
Notice how regardless of what we get the first time we flip the coin, the probability of getting heads is 0.5 throughout. Coin flips, in this scenario, are independent. The result of one coin flip does not impact the next coin flip.

5.3.2 Conditional probability
Imagine that 60% of people in a community have a disease. A doctor develops a test to determine if a random person has the disease. However, this test isn’t 100% accurate. There is an 80% probability of correctly returning positive if the person has the disease and 90% probability of correctly returning negative if the person does not have the disease.
The probability of a random person having the disease is 0.6. Since each person either has the disease or doesn’t (those are the only two possibilities), the probability that a person does not have the disease is \(1 - 0.6 = 0.4\).

Now the tree branches out.
- If the random person has the disease, then we go up the top branch. The probability of an infected person testing positive is 0.8 because the test is 80% sure of correctly returning positive when the person has the disease.
- By the same logic, if the random person does not have the disease, we go down the bottom branch. The probability of the person incorrectly testing positive is 0.1.
We decide to go down the top branch if our random person has the disease. We go down the bottom branch if they do not. This is called conditional probability. The probability of testing positive is dependent on whether the person has the disease.
How would you express this in statistical notation? \(\text{Prob}(A|B)\) is the same thing as the probability of A given B. \(\text{Prob}(A|B)\) essentially means the probability of A if we know for sure the value of B. Note that \(\text{Prob}(A|B)\) is not the same thing as \(\text{Prob}(B|A)\).
5.4 Two models
The simplest possible setting for inference involves two models — meaning two possible states of the world — and two outcomes from an experiment. Imagine that there is a disease — Probophobia, an irrational fear of probability — which you either have or don’t have. We don’t know if you have the diseases, but we do assume that there are only two possibilities. We also know that 1% of the people in the population suffer from Probophobia. Woe is them!
We also have a test which is 95% accurate. This is our experiment, one with only two possible outcomes: positive and negative.
Question: If you test positive, what is the probability that you have Probophobia?
More generally, we are estimating a conditional probability. Conditional on the outcome of a postive test, what is the probability that you have Probophobia? Mathematically, we want:
\[ \text{Prob}(\text{Disease | Test = Postive} ) \]
To answer this question, we need to use the tools of joint and conditional probability that we have learned earlier in this Chapter. We begin by building, by hand, the joint distribution of the possible models (you have the Probophobia or you do not) and of the possible outcomes (you test positive or negative). Building the joint distribution involves assuming that each model is true and then creating the distribution of outcomes which might occur if that assumption is true.
For example, assume you have Probophobia. There is then a 95% chance that you test positive and a 5% chance you test negative. Similarly, if we assume that the second model is true — that you don’t have Probophobia — then there is 5% chance you test positive and a 95% you chance negative. Of course, for you (or any individual) we do not know for sure what is happening. We do not know if you have the disease. We do not know what your test will show. But we can use these relationships to construct the joint distribution.
# It is often handy to bring out a key parameter rather than searching for it in
# the code every time you want to change it. When testing this code, we would
# set `sims` to a number like 3, just to make sure everything is working.
sims <- 100000
# Pipes generally start with tibbles, so we start with a tibble which just
# includes an ID variable. We don't really use ID. It is just handy for getting
# organized. We call this object `jd_disease`, where the `jd` standds for
# *j*oint *d*istribution.
jd_disease <- tibble(ID = 1:sims) %>%
# For each ID, we (randomly!) determine whether or not the person has
# Probophobia. In the population, the rate is only 1%, thus explaining the
# `prob` value of 0.01. Note that the call to `rbinom()` does not use the
# value of the `ID` variable.
mutate(have_disease = map_int(ID, ~ rbinom(n = 1,
size = 1,
prob = 0.01))) %>%
# For each person, we (randomly!) decide whether or not their test comes back
# positive. The test is 95% accurate, explaining the use of 0.95 in
# `rbinom()`. Then the hackery begins! rbinom() returns 1 or 0, which
# `ifelse()` treats as TRUE or FALSE. For TRUE, `test_positive` is assigned
# the same value as `have_disease`, which is what we want. The test was
# accurate. This is the single `.`. But 5% of the time, rbinom() will return a
# 0, meaning FALSE, meaning that we assign the opposite value --- generated by
# `! .` --- of `have_disease`. The use of `map_int()` ensures that we get a 0
# or 1 as the return value.
mutate(test_positive = map_int(have_disease,
~ ifelse(rbinom(n = 1,
size = 1,
prob = 0.95),
.,
! .)))
jd_disease## # A tibble: 100,000 x 3
## ID have_disease test_positive
## <int> <int> <int>
## 1 1 0 0
## 2 2 0 0
## 3 3 0 0
## 4 4 0 0
## 5 5 0 0
## 6 6 0 0
## 7 7 0 0
## 8 8 0 1
## 9 9 0 0
## 10 10 1 1
## # … with 99,990 more rows
Most values for have_diseases are zero since 99% of the population does not suffer from Probophobia. Most of those will have a negative test result, meaning test_positive = 0. However, because the test is not perfect, some of those, like ID = 8, will have a “false positive” result. Other individuals, like ID = 10, will have the diseases, and 95% of those will have a “true positive” result, as 10 does.
Plot the joint distribution:
jd_disease %>%
ggplot(aes(x = as.factor(test_positive),
y = as.factor(have_disease))) +
geom_point() +
geom_jitter(alpha = .2) +
labs(title = "Unnormalized Distribution of Test Results and Disease Status",
subtitle = "Rare diseases have many false positives",
x = "Test Result",
y = "Disease Status") +
scale_x_discrete(breaks = c(0, 1),
labels = c("Negative", "Positive")) +
scale_y_discrete(breaks = c(0, 1),
labels = c("Negative", "Positive")) +
theme_classic()
Below is a joint distribution displayed in 3D. Instead of using the “jitter” feature in R to unstack the dots, we are using a 3D plot to visualize the number of dots in each box. The number of people who correctly test negative is far greater than of the other categories. There are far fewer false positives and true positives – and only a handful of false negatives. This is why we can barely see the 3D bar coming from those combinations.
This Section is called “Two Models” because, for each person, there are two possible states of the world: have the disease or not have the disease. By assumption, there are no other outcomes. We call these two possible states of the world “models,” even though they are very simple models.
In addition to the two models, we have two possible results of our experiment on a given person: test positive or test negative. Again, this is an assumption. We do not allow for any other outcome. In coming sections, we will look at more complex situations where we consider more than two models and more than two possible results of the experiment. In the meantime, we have built the unnormalized joint distribution for models and results. This is a key point! Look back earlier in this Chapter for discussions about both unnormalized distributions and joint distributions.
We want to analyze these plots by looking at different slices. For instance, let’s say that you have tested positive for the disease. Since the test is not always accurate, you cannot be 100% certain that you have it. We isolate the slice where the test result equals 1 (meaning positive).
jd_disease %>%
filter(test_positive == 1)## # A tibble: 5,946 x 3
## ID have_disease test_positive
## <int> <int> <int>
## 1 8 0 1
## 2 10 1 1
## 3 19 0 1
## 4 29 0 1
## 5 47 0 1
## 6 51 1 1
## 7 57 0 1
## 8 77 0 1
## 9 78 1 1
## 10 82 0 1
## # … with 5,936 more rows
Most people who test positive do not have the disease! This is a common result for rare diseases. We can easily create an unnormalized conditional distribution with:
jd_disease %>%
filter(test_positive == 1) %>%
ggplot(aes(have_disease)) +
geom_bar() +
labs(title = "Disease Status Given Positive Test",
subtitle = "Rare diseases generates mostly false positives",
x = "Disease Status",
y = "Count") +
scale_x_continuous(breaks = c(0, 1),
labels = c("Healthy", "Infected")) +
theme_classic()
filter() transforms a joint distribution into a conditional distribution.
Turn this unnormalized distribution into a posterior probability distribution:
jd_disease %>%
filter(test_positive == 1) %>%
ggplot(aes(have_disease)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
binwidth = 0.25,
color = "white") +
labs(title = "Posterior for Probophobia Conditional on Positive Test",
x = "Probophobia Status",
y = "Probability") +
scale_x_continuous(breaks = c(0, 1),
labels = c("Healthy", "Infected")) +
scale_y_continuous(labels =
scales::percent_format(accuracy = 1)) +
theme_classic() 
Even with a positive test, you are more than 80% likely to be healthy.
This Stat 110 Animations video does a really good job of explaining similar concepts.
5.5 Three models

Imagine that your friend gives you a bag with two marbles. There could either be two white marbles, two black marbles, or one of each color. Thus, the bag could contain 0% white marbles, 50% white marbles, or 100% white marbles. The proportion, \(p\), of white marbles could be 0, 0.5, or 1.
Let’s say you take a marble out of the bag, record whether it’s black or white, then return it to the bag. You repeat this three times, observing the number of white marbles you see out of three trials. You could get three whites, two whites, one white, or zero whites as a result of this trial. Let’s make what we call a Bayes scatterplot out of this. We have three models (three different proportions of white marbles in the bag) and four possible experimental results. Let’s create 3,000 draws from this joint distribution:
# Create the joint distribution of the number of white marbles in the bag
# (in_bag) and the number of white marbles pulled out in the sample (in_sample),
# one-by-one. in_bag takes three possible values: 0, 1 and 2, corresponding to
# zero, one and two white marbles potentially in the bag.
sims <- 10000
# We also start off with a tibble. It just makes things easier
jd_marbles <- tibble(ID = 1:sims) %>%
# For each row, we (randomly!) determine the number of white marbles in the
# bag. We do not know why the `as.integer()` hack is necessary. Shouldn't
# `map_int()` automatically coerce the result of `sample()` into an integer?
mutate(in_bag = map_int(ID, ~ as.integer(sample(c(0, 1, 2),
size = 1)))) %>%
# Depending on the number of white marbles in the bag, we randomly draw out 0,
# 1, 2, or 3 white marbles in our experiment. We need `p = ./2` to transform
# the number of white marbles into the probability of drawing out a white
# marble in a single draw. That probability is either 0%, 50% or 100%.
mutate(in_sample = map_int(in_bag, ~ rbinom(n = 1,
size = 3,
p = ./2)))
jd_marbles## # A tibble: 10,000 x 3
## ID in_bag in_sample
## <int> <int> <int>
## 1 1 0 0
## 2 2 2 3
## 3 3 0 0
## 4 4 2 3
## 5 5 0 0
## 6 6 0 0
## 7 7 0 0
## 8 8 0 0
## 9 9 2 3
## 10 10 0 0
## # … with 9,990 more rows
Plot the joint distribution:
# The distribution is unnormalized. All we see is the number of outcomes in each
# "bucket." Although it is never stated clearly, we are assuming that there is
# an equal likelihood of 0, 1 or 2 white marbles in the bag.
jd_marbles %>%
ggplot(aes(x = in_sample, y = in_bag)) +
geom_jitter(alpha = 0.5) +
labs(title = "Black and White Marbles",
subtitle = "More white marbles in bag mean more white marbles selected",
x = "White Marbles Selected",
y = "White Marbles in the Bag") +
scale_y_continuous(breaks = c(0, 1, 2)) +
theme_classic()
Here is the 3D visualization:
The y-axes of both the scatterplot and the 3D visualization are labeled “Number of White Marbles in the Bag.” Each value on the y-axis is a model, a belief about the world. For instance, when the model is 0, we have no white marbles in the bag, meaning that none of the marbles we pull out in the sample will be white.
Let’s say we draw out three white marbles. We isolate the slice where the result of the simulation involves three white marbles and zero black ones. Here is the unnormalized probability distribution.
# The key step is the filter. Creating a conditional distribution from a joint
# distribution is the same thing as filtering that joint distribution for a
# specific value. A conditional distribution is a "slice" of the joint
# distribution, and we take that slice with filter().
jd_marbles %>%
filter(in_sample == 3) %>%
ggplot(aes(in_bag)) +
geom_histogram(binwidth = 0.5, color = "white") +
labs(title = "Unnormalized Conditional Distribution",
subtitle = "Number of white marbles in bag given that three were selected in the sample",
x = "Number of White Marbles in the Bag",
y = "Count") +
coord_cartesian(xlim = c(0, 2)) +
scale_x_continuous(breaks = c(0, 1, 2)) +
theme_classic()
Next, let’s normalize the distribution.
jd_marbles %>%
filter(in_sample == 3) %>%
ggplot(aes(in_bag)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
binwidth = 0.5,
color = "white") +
labs(title = "Posterior Probability Distribution",
subtitle = "Number of white marbles in bag given that three were selected in the sample",
x = "Number of White Marbles in the Bag",
y = "Probability") +
coord_cartesian(xlim = c(0, 2)) +
scale_x_continuous(breaks = c(0, 1, 2)) +
scale_y_continuous(labels =
scales::percent_format(accuracy = 1)) +
theme_classic()
This plot makes sense because when all three marbles you draw out of the bag are white, there is a pretty good chance that there are no black marbles in the bag. But you can’t be certain! It is possible to draw three white even if the bag contains one white and one black. However, it is impossible that there are zero white marbles in the bag.
5.6 N models

Assume that there is a coin with \(\rho_h\). We guarantee that there are only 11 possible values of \(\rho_h\): \(0, 0.1, 0.2, ..., 0.9, 1\). In other words, there are 11 possible models, 11 things which might be true about the world. This is just like situations we have previously discussed, except that there are more models to consider.
We are going to run an experiment in which you flip the coin 20 times and record the number of heads. What does this result tell you about the value of \(\rho_h\)? Ultimately, we will want to calculate a posterior distribution of \(\rho_h\), which is written as p(\(\rho_h\)).
To start, it is useful to consider all the things which might happen if, for example, \(\rho_h = 0.4\). Fortunately, the R functions for simulating random variables makes this easy.

First, notice that many different things can happen! Even if we know, for certain, that \(\rho_h = 0.4\), many outcomes are possible. Life is remarkably random. Second, the most likely result of the experiment is 8 heads, as we would expect. Third, we have transformed the raw counts of how many times each total appeared into a probability distribution. Sometimes, however, it is convenient to just keep track of the raw counts. The shape of the figure is the same in both cases.

Either way, the figures show what would have happened if that model — that \(\rho_h = 0.4\) — were true.
We can do the same thing for all 11 possible models, calculating what would happen if each of them were true. This is somewhat counterfactual since only one of them can be true. Yet this assumption does allow us to create the joint distribution of models which might be true and of data which our experiment might generate. Let’s simplify this is p(models, data), although you should keep the precise meaning in mind.

Here is the 3D version of the same plot.
In both of these diagrams, we see 11 models and 21 outcomes. We don’t really care about the p(\(models\), \(data\)), the joint distribution of the models-which-might-be-true and the data-which-our-experiment-might-generate. Instead, we want to estimate \(p\), the unknown parameter which determines the probability that this coin will come up heads when tossed. The joint distribution alone can’t tell us that. We created the joint distribution before we had even conducted the experiment. It is our creation, a tool which we use to make inferences. Instead, we want the conditional distribution, p(\(models\) | \(data = 8\)). We have the results of the experiment. What do those results tell us about the probability distribution of \(p\)?
To answer this question, we simply take a vertical slice from the joint distribution at the point of the x-axis corresponding to the results of the experiment.
This animation shows what we want to do with joint distributions. We take a slice (the red one), isolate it, rotate it to look at the conditional distribution, normalize it (change the values along the current z-axis from counts to probabilities), then observe the resulting posterior.

This is the only part of the joint distribution that we care about. We aren’t interested in what the object looks like where, for example, the number of heads is 11. That portion is irrelevant because we observed 8 heads, not 11. By using the filter function on the simulation tibble we created, we can conclude that there are a total of 465 times in our simulation in which 8 heads were observed.
As we would expect, most of the time when 8 coin tosses came up heads, the value of \(p\) was 0.4. But, on numerous occasions, it was not. It is quite common for a value of \(p\) like 0.3 or 0.5 to generate 8 heads. Consider:

Yet this is a distribution of raw counts. It is an unnormalized density. To turn it into a proper probability density (i.e., one in which the sum of the probabilities across possible outcomes sums to one) we just divide everything by the total number of observations.

The most likely value of \(\rho_h\) is still 0.4, as before. But, it is much more likely that \(p\) is either 0.3 or 0.5. And there is about an 8% chance that \(\rho_h \ge 0.6\).
You might be wondering: what is the use of a model? Well, let’s say we toss the coin 20 times and get 8 heads again. Given this result, what is the probability that future samples of 20 flips will result in 10 or more heads?
There are three main ways you could go about solving this problem with simulations.
The first wrong way to do this is assuming that \(\rho_h\) is certain because we observed 8 heads after 20 tosses. We would conclude that 8/20 gives us 0.4. The big problem with this is that you are ignoring your uncertainty when estimating \(\rho_h\). This would lead us to the following code.
sims <- 10000000
odds <- tibble(sim_ID = 1:sims) %>%
mutate(heads = map_int(sim_ID, ~ rbinom(n = 1, size = 20, p = .4))) %>%
mutate(result = ifelse(heads >= 10, TRUE, FALSE)) %>%
summarize(success = sum(result)/sims)
odds## # A tibble: 1 x 1
## success
## <dbl>
## 1 0.245
The second method involves sampling the whole posterior distribution vector we previously created. This would lead to the following correct code.
p_draws <- tibble(p = rep(seq(0, 1, 0.1), 1000)) %>%
mutate(heads = map_int(p, ~ rbinom(n = 1, size = 20, p = .))) %>%
filter(heads == 8)
odds <- tibble(p = sample(p_draws$p, size = sims, replace = TRUE)) %>%
mutate(heads = map_int(p, ~ rbinom(n = 1, size = 20, p = .))) %>%
mutate(result = ifelse(heads >= 10, TRUE, FALSE)) %>%
summarize(success = sum(result)/sims)
odds## # A tibble: 1 x 1
## success
## <dbl>
## 1 0.328
Third way is to sample from the actual distribution, which is a small dataset with just a few rows, but it includes both \(\rho_h\) and the probability of each \(\rho_h\). This also gives the correct answer.
p_posterior <- jd_coin %>%
filter(heads == 8) %>%
group_by(p) %>%
summarize(total = n(), .groups = "drop") %>%
mutate(probs = total/sum(total))
odds <- tibble(p = sample(p_posterior$p,
size = sims,
prob = p_posterior$probs,
replace = TRUE)) %>%
mutate(heads = map_int(p, ~ rbinom(n = 1,
size = 20,
p = .))) %>%
mutate(result = ifelse(heads >= 10,
TRUE,
FALSE)) %>%
summarize(success = sum(result)/sims)
odds## # A tibble: 1 x 1
## success
## <dbl>
## 1 0.320
As you may have noticed, if you calculated the value using the first method, you would believe that getting 10 or more heads is less likely than it really is. If you were to run a casino based on these assumptions, you will lose all your money. It is very important to be careful about the assumptions you are making. We tossed a coin 20 times and got 8 heads. However, you would be wrong to assume that \(\rho_h\) = 0.4 just based on this result.
5.7 Cardinal Virtues
The four Cardinal Virtues are Wisdom, Justice, Courage, and Temperance. Because data science is, ultimately, a moral act, we use these virtues to guide our work.
Wisdom begins with the ideal Preceptor Table. What data, if we had it, would allow us to answer our question easily? We explore the data we have. Are the data we have close enough to the data we want that we can consider all the data as coming from the same population? If not, we can’t proceed further.
Justice looks at the actual Preceptor Table, a structure which includes a row for every unit in the population – the data we want to have and the data which we actually have. Is our data representative of the population? Is the meaning of the columns consistent, i.e., can we assume validity? We then make an assumption about the data generating mechanism. What general mathematical formula connects the outcome variable we are interested in with the other data that we have?
Courage allows us to explore different models. Even though Justice has provided the basic mathematical structure of the model, we still need to decide which variables to include and to estimate the values of unknown parameters. We avoid hypothesis tests. We check our models for consistency with the data we have. We select one model.
Temperance guides us in the use of the model we have created to answer the questions we began with. We create posteriors of quantities of interest. We should be modest in the claims we make. The posteriors we create are never the “truth.” The assumptions we made to create the model are never perfect. Yet decisions made with flawed posteriors are almost always better than
5.7.1 Wisdom

FIGURE 5.4: Wisdom.
Wisdom helps us decide if we can even hope to answer our question with the data we have.
First, start with the ideal Preceptor Table. What rows and columns of data do you need such that, if you had them all, the calculation of the number of interest would be trivial? If you want to know the average height of an adult in India, then the ideal Preceptor Table would include a row for each adult and a column for their height. With no missing data, the average is easy to determine, as would a wide variety of other estimands, other unknown numbers.
One key aspect of this ideal Preceptor Table is whether or not we need more than one potential outcome in order to calculate our estimand. For example, if we want to know the causal effect of exposure to Spanish-speakers on attitude toward immigration that we need a causal model, one which estimates that attitude under both treatment and control. The Preceptor Table would require two columns for the outcome. If, on the other hand, we only want to predict someone’s attitude, or compare one person’s attitide to another’s, then we would only need a Preceptor Table with one column for the outcome.
Are we are modeling (just) for prediction or are we (also) modeling for causation? Predictive models care nothing about causation. Causal models are often also concerned with prediction, if only as a means of measuring the quality of the model.
Every model is predictive, in the sense that, if we give you new data — and it is drawn from a stable distribution — then you can create a predictive forecast. But only a subset of those models are causal, meaning that, for a given individual, you can change the value of one input and figure out what the new output would be and then, from that, calculate the causal effect.
With prediction, all we care about is forecasting Y given X on some as-yet-unseen data — implying that the Preceptor Table has rows in which we know all the X but not the Y. But there is no notion of “manipulation” in such models. We don’t pretend that, for Joe, we could turn variable X from a value of 5 to a value of 30 by just turning some knob and, by doing so, cause Joe’s value of Y to change from 17 to 23. We can compare two people (or two groups of people), one with X equal to 5 and one with X equal to 30, and see how they differ in Y. The basic assumption of predictive models is that there is only one possible Y for Joe. There are not, by assumption, two possible values for Y, one if X equal 5 and another if X equals 30. The Preceptor Table has a single column under Y.
With causal inference, however, we can consider the case of Joe with \(X = 5\) and Joe with \(X = 30\). The same mathematical model can be used. And both models can be used for prediction, for estimating what the value of Y will be for a yet-unseen observation with values X. But, in this case, instead of only a single column in the Preceptor Table for Y, we have at least two (and possibly many) such columns, one for each of the potential outcomes under consideration.
The difference between prediction models and causal models is that the former have one column for Y and the latter have more than one.
Second, we look at the data we have and perform an exploratory data analysis, an EDA. You can never look at your data too much. The most important variable is the one we most want to understand/explain/predict/manipulate. In the models we create in later chapters. This variable will go on the lefthand side of our mathematical equations. Some academic fields refer to this as the “dependent variable.” Others use terms like “regressor” or “outcome.” Whatever the terminology, we need to explore the distribution of this variable, its min/max/range, its mean and median, its standard deviation.
Gelman, Hill, and Vehtari (2020) write:
Most important is that the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. Optimally, this means that the outcome measure should accurately reflect the phenomenon of interest, the model should include all relevant predictors, and the model should generalize to the cases to which it will be applied.
For example, with regard to the outcome variable, a model of incomes will not necessarily tell you about patterns of total assets. A model of test scores will not necessarily tell you about child intelligence or cognitive development. …
We care about other variables as well, especially those that are most correlated/connected with the outcome variable. The more time that we spend looking at these variables, the more likely we are to create a useful model.
Third, a key concept is the “population.” We need the data we want, the ideal Preceptor Table, and the data we have to be similar enough that we can consider them as all having come from the same statistical population. From Wikipedia:
In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hypothetical and potentially infinite group of objects conceived as a generalization from experience (e.g. the set of all possible hands in a game of poker).
If we assume that the data we have is drawn from the same population as the data in the ideal Preceptor Table, then we can use information about the former to make inferences about the latter. We can combine the ideal Preceptor Table and the data into a single hypothetical data set. If we can’t do that, if we can’t assume that the two sources come from the same population, then we can’t use our data to answer our questions. We have no choice but to walk away. The heart of Wisdom is knowing when to walk away. As John Tukey noted:
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.
5.7.2 Justice

FIGURE 5.5: Justice.
After Wisdom, we have an actual Preceptor Table. It includes rows for the data we have and the data we want to have. It has missing values, most importantly for potential outcomes which were not observed. The central problem in inference is to fill in the question marks in the actual Preceptor Table.
There are three key aspects of Justice: the data generating mechanism, representativeness and validity.
First, the data generating mechanism is the mathematical formula, and associated error term, which relates our outcome variable to our covariates.
Justice requires math. Consider a model of coin-tossing:
\[ H_i \sim B(\rho_H, n = 20) \] The total number \(H\) of Heads in experiment \(i\) with 20 flips of a single coin, \(H_i\), is distributed as a binomial with \(n = 20\) and an unknown probability \(\rho_h\) of the coin coming up Heads.
Note:
This is a cheat and a simplification! We are Bayesians but we have not specified the full Bayesian machinery. We really need priors on the unknown parameter \(\rho_h\) as well. But that is too complex for an introductory class, so we wave our hands, accept the default sensible parameters built into the R packages we use and point readers to more advanced books, like Gelman, Hill, and Vehtari (2020).
Defining \(\rho_h\) as the “the probability that the coin comes up Heads” is a bit of a fudge. If you calculate that by hand and then compare it to what our tools produce, they won’t be the same. Instead, the calculated value will be closer to zero. Why? \(\rho_h\) is really the “long-run percentage of the time the coin comes up Heads.” It is not just the percentage from this experiment.
In this simple case, we are fortunate that the parameter \(\rho_h\) has such a (mostly!) simple analog to a real world quantity. Much of the time, parameters are not so easy to interpret. In a more complex model, especially with one with interaction terms, we focus less on parameters and more on actual predictions.
Wisdom allowed us to assume that all the rows in the actual Preceptor Table come from the same population. If that is true, then it is also true that the mathematical relationship between outcomes and covariates is consistent across these rows.
Second, we should check how “representative” the data we have is of the entire population.
Third, confirm that the data we have accurately capturing the concepts we care about. Is the data valid, given the problem we are trying to solve?
Validity is about the columns in our tibble. Representativeness is about the rows.
5.7.3 Courage

FIGURE 5.6: Courage.
The three languages of data science are words, math and code, and the most important of these is code. We need to explain the structure of our model using all three languages, but we need Courage to implement the model in code.
Courage requires us to take the general mathematical formula provide by Justice and then make it specific. Which variables should we include in the model and which do we exclude? Every data science project involves the creation of several models. For each, we specify the precise data generating mechanism. Using that formula, and some R code, we create a fitted model. All models have parameters. We can never know the true values of the parameters, but we can create, and explore, posterior distributions for those unknown true values.
Code allows us to “fit” a model by estimating the values of the unknown parameters, like \(\rho_h\). Sadly, we can never know the true values of these parameters. But, like all good data scientists, we can express our uncertain knowledge in the form of posterior probability distributions. With those distributions, we can compare the actual values of the outcome variable with the “fitted” or “predicted” results of the model. We can examine the “residuals,” the difference between the fitted and actual values.
Every outcome is the sum of two parts: the model and what is not in the model:
\[outcome = model + what\ is\ not\ in\ the\ model\]
It doesn’t matter what the outcome is. It could be the result of a coin flip, the weight of a person, the GDP of a country. Whatever outcome we are considering is always made up of two parts. The first is the model we have created. The second is all the stuff — all the blooming and buzzing complexity of the real world — which is not a part of the model.
Some of our uncertainty is driven by our ignorance about \(\rho_h\).
A parameter is something which does not exist in the real world. (If it did, or could, then it would be data.) Instead, a parameter is a mental abstraction, a building block which we will use to to help us accomplish our true goal: To replace at least some of the questions marks in the actual Preceptor Table. Since parameters are mental abstractions, we will always be uncertain as to their value, however much data we might collect.
But some, often most, of the uncertainty comes from forces that are, by assumption, not in the model. For example, if the coin is fair, we expect \(T_i\) to equal 10. But, often, it will be different, even if we are correct and \(\rho_h\) equals exactly 0.5.
Some randomness is intrinsic in this fallen world.
5.7.4 Temperance

FIGURE 5.7: Temperance.
There are few more important concepts in statistics and data science than the “Data Generating Mechanism.” Our data — the data that we collect and see — has been generated by the complexity and confusion of the world. God’s own mechanism has brought His data to us. Our job is to build a model of that process, to create, on the computer, a mechanism which generates fake data consistent with the data which we see. With that DGM, we can answer any question which we might have. In particular, with the DGM, we provide predictions of data we have not seen and estimates of the uncertainty associated with those predictions. Courage helped us to create the DGM. Temperance will guide us in its use.
Having created (and checked) a model, we now use the model to answer questions. Models are made for use, not for beauty. The world confronts us. Make decisions we must. Our decisions will be better ones if we use high quality models to help make them.
Sadly, our models are never as good as we would like them to be. First, the world is intrinsically uncertain.

FIGURE 5.8: Donald Rumsfeld.
There are known knowns. There are things we know we know. We also know there are known unknowns. That is to say, we know there are some things we do not know. But there are also unknown unknowns, the ones we do not know we do not know. – Donald Rumsfeld
What we really care about is data we haven’t seen yet, mostly data from tomorrow. But what if the world changes, as it always does? If it doesn’t change much, maybe we are OK. If it changes a lot, then what good will our model be? In general, the world changes some. That means that are forecasts are more uncertain that a naive use of our model might suggest.

FIGURE 5.9: Three Card Monte.
What does this mean? Well imagine a crowd playing Three Card Monte in the streets of New York. The guy running the game runs a demo and shows you all the cards to make you confident. They earn money by making you overconfident and persuading you to bet. Your odds may seem good during the demo round, but that doesn’t actually say anything about what will likely happen when the real, high stakes game begins. The person running the game does many simulations, making the “victim” forget that they cannot actually make any conclusions about the odds of winning. There are some variables that we simply do not know even if we put a lot of effort into making posterior probability distributions. People can be using slight of hand, for instance.
We need patience in order to study and understand the unknown unknowns in our data. Patience is also important when we analyze the “realism” of our models. When we created the mathematical probability distribution for presidential elections, for instance, we assumed that the Democratic candidate would have a 50% chance of winning each vote in the electoral college. By comparing the mathematical model to our empirical cases, however, we recognize that the mathematical model is unlikely to be true. The mathematical model suggested that getting fewer than 100 votes is next to impossible, but many past Democratic candidates in the empirical distribution received less than 100 electoral votes.
In Temperance, the key distinction is between the true posterior distribution — what we will call “Preceptor’s Posterior” — and the estimated posterior distribution. Recall our discussion from Section 2.8. Imagine that every assumption we made in Wisdom and Justice were correct, that we correctly understand every aspect of how the world works. We still would not know the unknown value we are trying to estimate — recall the Fundamental Problem of Causal Inference — but the posterior we created would be perfect. That is Preceptor’s Posterior. Sadly, even if our estimated posterior is, very close to Preceptor’s Posterior, we can never be sure of that fact, because we can never know the truth, never be certain that all the assumptions we made are correct.
Even worse, we must always worry that our estimated posterior, despite all the work we put into creating it, is far from the truth. We, therefore, must be cautious in our use of that posterior, humble in our claims about its accuracy. Using our posterior, despite its fails, is better than not using it. Yet it is, as best, a distorted map of reality, a glass through which we must look darkly. Use your posterior with humility.
5.8 Summary
Validity is about the columns in our tibble. Representativeness is about the rows.
Throughout this chapter, we spent time going through examples of conditional distributions. However, it’s worth noting that all probability distributions are conditional on something. Even in the most simple examples, when we were flipping a coin multiple times, we were assuming that the probability of getting heads versus tails did not change between tosses.
We also discussed the difference between empirical, mathematical, and posterior probability distributions. Even though we developed these heuristics to better understand distributions, every time we make a claim about the world, it is based on our beliefs - what we think about the world. We could be wrong. Our beliefs can differ. Two reasonable people can have conflicting beliefs about the fairness of a die.
It is useful to understand the three types of distributions and the concept of conditional distributions, but almost every probability distribution is conditional and posterior. We can leave out both words in future discussions, as we generally will in this book. They are implicit.
If you are keen to learn more about probability, here is a video featuring Professor Gary King. This is a great way to review some of the concepts we covered in this chapter, albeit at a higher level of mathematics.